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Abstract 

The beta-Bernoulli process provides a Bayesian nonparametric prior 
for models involving collections of binary- valued features. A draw from 
the beta process yields an infinite collection of probabilities in the unit 
interval, and a draw from the Bernoulli process turns these into binary- 
valued features. Recent work has provided stick-breaking representations 
for the beta process analogous to the well-known stick-breaking repre- 
sentation for the Dirichlet process. We derive one such stick-breaking 
representation directly from the characterization of the beta process as a 
completely random measure. This approach motivates a three-parameter 
generalization of the beta process, and we study the power laws that can 
be obtained from this generalized beta process. We present a posterior 
inference algorithm for the beta-Bernoulli process that exploits the stick- 
breaking representation, and we present experimental results for a discrete 
factor-analysis model. 



1 Introduction 

Large data sets are often heterogeneous, arising as amalgams from underlying 
sub-populations. The analysis of large data sets thus often involves some form 
of stratification in which groupings are identified that are more homogeneous 
than the original data. While this can sometimes be done on the basis of 
explicit covariates, it is also commonly the case that the groupings arc captured 
via discrete latent variables that are to be inferred as part of the analysis. 
Within a Bayesian framework, there are two widely employed modeling motifs 
for problems of this kind. The first is the Dirichlet-multinomial motif, which is 
based on the assumption that there are K "clusters" that are assumed to be 
mutually exclusive and exhaustive, such that allocations of data to clusters can 
be modeled via a multinomial random variable whose parameter vector is drawn 
from a Dirichlet distribution. A second motif is the beta- Bernoulli motif, where 
a collection of K binary "features" are used to describe the data, and where each 
feature is modeled as a Bernoulli random variable whose parameter is obtained 
from a beta distibution. The latter motif can be converted to the former in 
principle — we can view particular patterns of ones and zeros as defining a cluster, 
thus obtaining M = 2 K clusters in total. But in practice models based on 
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the Dirichlet- multinomial motif typically require O(M) additional parameters 
in the likelihood, whereas those based on the beta-Bernoulli motif typically 
require only O(K) additional parameters. Thus, if the combinatorial structure 
encoded by the binary features captures real structure in the data, then the 
beta-Bernoulli motif can make more efficient usage of its parameters. 

The Dirichlct-multinomial motif can be extended to a stochastic process 
known as the Dirichlet process. A draw from a Dirichlet process is a ran- 
dom probability measure that can be represented as follow s jMcCloskevl . Il965 , 
Patil and Tailliel . Il977l [Fergusonl [l97l ISethuramanl . Il994j : 



(1) 



i=l 



where represents an atomic measure at location ipi, where both the {iti} and 
the are random, and where the {71^} are nonnegative and sum to one (with 
probability one). Conditioning on G and drawing N values independently from 
G yields a collection of K distinct values, where K < N is random and grows 
(in expectation) at rate O(logTV). Treating these distinct values as indices of 
clusters, we obtain a model in which the number of clusters is random and 
subject to posterior inference. 

A great deal is known about the Dirichlet process — there are direct connec- 
tions between properties of G as a random measure (e.g., it can be obtained 
from a Poisson point process), properties of the sequence of values {^i} (they 
can be obtained from a "stick-breaking process" ) , and properties of the collec- 
tion of distinct values obtained by sampling from G (they are characterized by a 
stochastic process known as the Chinese restaurant process). These connections 
have helped to place the Dirichlet process at the center of Bayesian nonparamet- 
rics, driving the development of a wide variety of inference algorithms for models 
based on Dirichlet process priors and suggesting a range of generalizatio ns fe 
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It is also possible to extend the beta-Bernoulli motif to a Bayesian nonpara- 
metric framework, and there is a growing literature on this topic. The underlying 
stochastic process is the beta process, which is a n instance of a fa mily of random 
measures known as completely random measures [Kingma nl ll967l. The b eta pro- 
cess was first studied in the context of survival analysis bv lHiortl |1990| . where 
the focus is on modeling hazard functions via the rand om cumulative distri- 
bution function obtained by integrating the beta process. iThibaux and Jordanl 



20071 ] focused instead on the beta process realization itself, which can be rep- 



resented as 



G 



where both the qi and the ifii are random and where the qi are contained in the 
interval (0, 1). This random measure can be viewed as furnishing an infinite col- 
lection of coins, which, when tossed repeatedly, yield a binary featural descrip- 
tion of a set of entities in which the number of features with non-zero values is 
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random. Thus, the resulting beta-Bernoulli process can be viewed as an infinite- 
dimen sional version of the beta-Bernoulli motif. Indeed, Thibaux and JordanI 
20071 ] showed that by integrating out the random q, and ipi one obtains — by 



analogy to the derivation of the Chinese restaurant process from the Dirichlct 
process — a combinator ial stochastic process know n as th e Indian buffet process, 
previously studied by I Griffiths and Ghahramani (2006| , who derived it via a 
limiting process involving random binary matrices obtained by sampling finite 
collections of beta-Bernoulli variables. 

Stick-breaking representations of the Dirichlet process have been particu- 
larly important both for algorithmic development and for exploring general- 
izations of the Dirichlet process. These representations yield explicit recursive 
formulas for obtaining the weights {71^} in Eq. ((T|). In the case of the beta 
process, explicit non- recursive represe ntations can be obtained fo r the weights 
{qi}, based o n size-biased sampling IThibaux and JordanI 120071 ] and inverse 
Levy measure [Wolpert and Ickstadtl . feOOi iTeh et all |2007| . Recent work has 
also yielded recursive constructions that are m ore closely related to the stick 



break ing representation of the Dirichlet process [Teh et al 
2010j . 



2007, iPaislev et al 



Stick-breaking representations of the Dirichlet process permit ready general- 
izations to stochastic processes that yield power-law beha vior (which the Dirich- 
let p r ocess does not ), notably the Pitman- Yor process [ishwaran and Jamesl 



200ll|Pitmanll2006l. Power- l aw ge neralizations of the beta process have also 



a I 

been studied ITeh and Goriirl . l2009j and stick-breaking-like representations de- 
rived. These latter representations are, however, based on the non-recursive 
sized-biased sa mpling and inver se-L evy methods rather than the recursive rep- 



res entations o f Teh et al.l 2007 1 and Paisley et al. 20 ldj | . 



Teh et all {2007| and lPaislev et al.l [2010| derived their stick-breaking repre- 



sentations of the beta process as li miting processes, making use of t he deriva- 
tion of the Indian buffet process by Griffiths and Ghahrama ni [2006] as a limit 
of finite-dimensional random matrices. In the current paper we show how to 
derive stick-breaking for the beta process directly from the underlying random 
measure. This approach not only has the advantage of conceptual clarity (our 
derivation is elementary), but it also permits a unified perspective on various 
generalizations of the beta process that yield power-law behavior^ We show in 
particular t hat it yields a power -law generalization of the stick-breaking repre- 



sentation of IPaislev et al 



power- 
ful]. 



To illustrate our results in the context of a concre te application, we study a 
discre te fa ctor analysis model p reviously considered bv lGriffiths and Ghahramani 



2006| and lPaislev et al.l [2010| . The model is of the form 



X = Z$ + E, 



(2) 



where X € M. NxP is the data and E € M. NxP is an error matrix. The matrix 
$ G M KxP is a matrix of factors, and Z <G R NxK is a binary matrix of factor 



1 A similar measure-theoretic derivation has been presented recently bv lPaislev et alj [201ll | , 
who focus on applications to truncations of the beta process. 
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loadings. The dimension K is infinite, and thus the rows of comprise an 
infinite collection of factors. The matrix Z is obtained via a draw from a beta- 
Bernoulli process; its nth row is an infinite binary vector of features (i.e., factor 
loadings) encoding which of the infinite collection of factors are used in modeling 
the nth data point. 

The remainder of the paper is organized as follows. We introduce the beta 
process, and its conjugate measure the Bernoulli process, in Section [2j In order 
to consider stick-breaking and power law behavior in the beta-Bernoulli frame- 
work, we first review stick-breaking for the Dirichlet process in Section [3] and 
power laws in clustering models in Section l4.ll We consider potential power laws 
that might exist in featural models in Section 14.21 Our main theoretical results 
come in the following two sections. First, in Section [5] we provide a proof that 
the stick-breaking representation of IPaislev et al.l [2010(, expanded to include a 



third parameter, holds for a three-parameter extension of the beta process. Our 
proof takes a measure-theoretic approach based on a Poisson process. We then 
make use of the Poisson process framework to establish asymptotic power laws, 
with exact constants, for the three-parameter beta process in Section 16.11 We 
also show, in Section 16721 that there are aspects of the beta-Bernoulli framework 
that cannot exhibit a power law. We illustrate the asymptotic power laws on a 
simulated data set in Section We present experimental results in Section [51 
and we present an MCMC algorithm for posterior inference in Appendix [75] 



2 The beta process and the Bernoulli process 

The beta process and the Bernoulli process are instances of t he general family 



of random measures known as completely random measures [Kingmanl . Il967j | . 



A completely random measure H on a probability space {^,S) is a random 
measure such that, for any disjoint measurable sets A\, . . . , A n € S, the random 
variables H{A\) 1 . . . , H{A n ) are independent. 

Completely random measures can be obtained from an underlying Poisson 
point process. Let v(dip,du) denote a c-finite measur^l on the product space 
\& x R. Draw a realization from a Poisson point process with rate measure 
v(dip,du). This yields a set of points II = {(ipi,Ui)}i, where the index i may 
range over a countable infinity. Finally, construct a random measure as follows: 

oo 

1=1 

where 8^, i denotes an atom at ipi. This discrete random measure is such that 
for any measurable set T € S, 

B(T)= J2 



2 The measure v need not necessarily be tr-finite to generate a completely random measure 
though we consider only cr-finite measures in this work. 
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Figure 1: The gray surface illustrates the rate density in Eq. ((4]) corresponding 
to the beta process. The base measure Bq is taken to be uniform on vff. The non- 
zero endpoints of the line segments plotted below the surface are a particular 
realization of the Poisson process, and the line segments themselves represent a 
realization of the beta process. 



That B is completely random follows from the Poisson point process construc- 
tion. 

In addition to the representation obtained from a Poisson process, completely 
random measures may include a deterministic measure and a set of atoms at 
fixed locations. The component of the completely random measure generated 
from a Poisson po int proces s as d escribed above is called the ordinary compo- 
nent. As shown bv lKingman 19621, completely random measures are essentially 



characterized by this representation. An example is shown in Figure [TJ 

The beta process, denoted B ~ BP(#,i?o), is an example of a completely 
random measure. As long as the base measure Bq is continuous, which is our 
assumption here, B has only an ordinary component with rate measure 

ubp (dip, du) = 6(i>)u~ l (i - u) ^- 1 du B {di>), ^e*,«e[0,l], (4) 

where 6 i s a po s itive function on *5f. The function 9 is called the concen t ration 
function |Hjort . 1990]. In the remainder we follow iThibaux and Jordan! |2007i | 



in taking 6 to be a real-valued constant and refer to it as the concentration 
parameter. We assume -Bo is nonnegative and fixed. The total mass of Bq, 
7 := Bq(^>), is called the mass parameter. We assume 7 is strictly positive 
and finite. The density in Eq. with the choice of Bq uniform over [0, 1], is 
illustrated in Figure [TJ 

The beta process can be viewed as providing an infinite collection of coin- 
tossing probabilities. Tossing these coins corresponds to a draw from the Bernoulli 
process, yielding an infinite binary vector that we will treat as a latent feature 
vector. 

More formally, a Bernoulli process Y ~ BeP(B) is a completely random 
measure with potentially both fixed atomic and ordinary components. In defin- 
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Figure 2: Upper left: A draw B from the beta process. Lower left: 50 draws 
from the Bernoulli process BeP(B). The vertical axis indexes the draw number 
among the 50 exchangeable draws. A point indicates a one at the corresponding 
location on the horizontal axis, ip E \&. Right: We can form a matrix from the 
lower left plot by including only those ip values with a non-zero number of 
Bernoulli successes among the 50 draws from the Bernoulli process. Then, the 
number of columns K is the number of such ip, and the number of rows N is 
the number of draws made. A black square indicates a one at the corresponding 
matrix position; a white square indicates a zero. 
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ing the Bernoulli process we consider only the case in which B is discrete, i.e., of 
the form in Eq. ([3]), though not necessarily a beta process draw or even random 
for the moment. Then Y has only a fixed atomic component and has the form 



y = 5><y*, (5) 



where hi ~ Bern(ui) for m the corresponding atomic mass in the measure B. 
We can see that E(F|S) = i?( v &) from the mean of the Bernoulli distribution, 
so the number of non-zero points in any realization of the Bernoulli process is 
finite when B is a finite measure. 

We can link the beta process and TV Bernoulli process draws to generate 
a random feature matrix Z. To that end, first draw B ~ BP(8,Bo) for fixed 

hyperparameters 6 and Bq and then draw Y n *~ BeP(-B) for n G {1, . . . , N}. 
Note that since B is discrete, each Y n will be discrete as in Eq. ([5]), with point 
masses only at the atoms {ipi} of the beta process B. Note also that KB(^f) = 
7 < oo, so B is a finite measure, and it follows that the number of non-zero 
point masses in any draw Y n from the Bernoulli process will be finite. Therefore, 
the total number of non-zero point masses K across N such Bernoulli process 
draws is finite. 

Now reorder the {ipi} so that the first K are exactly those locations where 
some Bernoulli process in {Y n }^ =1 has a non-zero point mass. We can form a 
matrix Z £ {0, \ J NxK as a function of the {Y n }^ =1 by letting the (n,k) entry 
equal one when Y n has a non-zero point mass at ip^ and zero otherwise. If 
we wish to think of Z as having an infinite number of columns, the remaining 
columns represent the point masses of the {Y n }^ =1 at {ipk}k>K, which we know 
to be zero by construction. We refer to the overall procedure of drawing Z ac- 
cording to, first, a beta process and then repeated Bernoulli process draws in this 
way as a beta- Bernoulli process, and we write Z ~ BP-BcP(7V, 7, 9). Note that 
we have implicitly integrated out the {ipk}, and the distrib ution of the matrix Z 
depen ds on Bq only through its total mass, 7. As shown bv lThibaux and Jordan 



20071 ] . this process yields the same distribut ion on row-exchangeable, infinite - 
column matrices as the Indian buffet process [Griffiths and Ghahramaml |2006| , 
which describes a stochastic process directly on (equivalence classes of) binary 
matrices. That is, the Indian buffet process is obtained as an exchangeable 
distribution on binary matrices when the underlying beta process measure is 
integrated out. This result is analogous to the derivation of the Chinese restau- 
rant process as the exchangeable distribution on partitions obtained when the 
underlying Dirichlet process is integrated out. The beta-Bernoulli process is 
illustrated in Figured 

3 Stick-breaking for the Dirichlet process 



The stick-breaking representation of the Dirichlet process (McCloskevl . 1965 



Patil and Taillid . 119771 ISethuramanl |1994{ provides a simple recursive procedure 
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Figure 3: A stick-breaking process starts with the unit interval (far left). First, 
a random fraction V\ of the unit interval is broken off; the remaining stick has 
length 1 — V\ (middle left). Next, a random fraction V2 of the remaining stick 
is broken off, i.e., a fragment of size V^(l — Vi); the remaining stick has length 
(1 — Vi)(l — V?). This process proceeds recursively and generates stick fragments 

V\ , V2 ( 1 — V\ ) , . . . , Vi Y\ ■ <i ( 1 — Vj ) , These fragments form a random partition 

of the unit interval (far right). 



for obtaining the weights {tti} in Eq. ([T]). This procedure provides an explicit 
representation of a draw G from the Dirichlet process, one which can be usefully 
instantiated and updat e d in p osterior inference algorithms Ishwaran and James , 



2001 , Blei and Jordan . 20061 ] . We begin this section by reviewing this stick 



breaking construction as well as some of the extensions to this construction that 
yield power-law behavior. We then turn to a consideration of stick-breaking and 
power laws in the setting of the beta process. 

Stick-breaking is the process of recursively breaking off random fractions of 
the unit interval. In particular, let Vi,V2,... be some countable sequence of 
random variables, each with range [0, 1]. Each V represents the fraction of the 
remaining stick to break off at step i. Thus, the first stick length generated by 
the stick-breaking process is V\. At this point, a fragment of length 1 — V\ of 
the original stick remains. Breaking off V% fraction of the remaining stick yields 
a second stick fragment of V%(1 — Vi). This process iterates such that the stick 
length broken off at step i is VifJ-^fl — Vj). The stick-breaking recursion is 
illustrated in Figure [3J 

The Dirichlet process arises from the spe cial case in which the Vj are indepen- 
dent draws from the Bet a(l, 0) distribution [McCloskev . 1965 . Patil and Taillie . 
19771 ISethuramanl Il994 |. Thus we have the following representation of a draw 
G~DP(0,G o ): 



G 



E 



J'=l 



Vi ~ Beta(l,0) 
ipi ~ &0, 



(6) 



where Go is referred to as the base measure and 9 is referred to as the concen- 
tration parameter. 



4 Power law behavior 

Consider the process of sampling a random measure G from a Dirichlet pro- 
cess and subsequently drawing independently N times from G. The number of 
unique atoms sampled according to this process will grow as a function of N. 
The growth associated with the Dirichlet process is relatively slow, however, 
and when the Dirichlet process is used as a prior in a clustering model one does 
not obtain the heavy-tailed behavior commonly referred to as a "power law." 
In this section we first provide a brief exposition of the different kinds of power 
law that we might wish to obtain in a clustering model and discuss how these 
laws can be obtained via an extension of the stick-breaking representation. We 
then discuss analogous laws for featural models. 



4.1 Power laws in clustering models 

First, we establish some notation. Given a number N of draws from a discrete 
random probability measure G (where G is not necessarily a draw from the 
Dirichlet process), let (Ni, N%, . . .) denote the sequence of counts associated 
with the unique values obtained among the N draws, where we view these 
unique values as "clusters." Let 



K N>j = £ l(Ni = j), 



(7) 



i=l 



and let 



K N = Y, UNi > 0). 



(8) 



i=i 



That is, Kjsf j is the number of clusters that are drawn exactly j times, and Kjy 
is the total number of clusters. 

There are two types of power-law behavior that a clustering model might 
exhi bit. First, there is the type of p ower law behavior reminiscent of Heaps' 
law [Heapsl . Il978l iGnedin et alll2007| : 



K 



N 



cN a , N oo 



(9) 



for some constants c > 0, a £ (0, 1). Here, ~ means that the limit of the ratio 
of the left-hand and right-hand side, when they are both real-valued and non- 
random, is one as the number of data points ./V grows large. We denote a power 
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law in the form of Eq. © as Typ e I. Second, there is the type o f power law 
behavior reminiscent of Zipf 's law (Zipl I1949L TGnedin et all l2007j : 



(10) 



again for some constants c > 0, a G (0, 1). We refer to the power law in Eq. ([TI 
as Type II. 

Sometimes in the case of Eq. (jlOp. we are interested in the behavior in j; 
therefore we recall 7! = r(7 + l) and note the foll owing fact about the T- function 
ratio in Eq. (pD [cf. iTricomi and Erdelvil Il95l| : 



r(j - a) 
r(j + 1) 



j 



(ii) 



Again, we see behavior in the form of a power law at work. 

Power-law beha vior of Types I and II [and equivalent formulations; see 



Gnedin et all l2007j | has been observed in a variety of real- world clustering prob- 



lems including, but not limited to: the number of species per plant genus, the 
in-degrec or out-degree of a graph constructed from hyperlinks on the Inter- 
net, the number of people in cities, the number of words in documents, the 
number of papers publ i shed by scientists, and the am ount each person earns in 



Mitzenmacherl 12004 iGoldwater et al. Baycsians modeling these 



situations will prefer a prior that reflects this distributional attribute 

While the Dirichlet process exhibits neither type of power-law behavior . 
the Pitman- Yor proces s yields both kinds of power law Pitman and Yoii 11997 



IGoldwater et all 120061 ] though we note that in this case c is a random variable 
(still with no dependence on N or j). The Pitman- Yor process, denoted G ~ 
PY(0, a, Go), is defined via the following stick-breaking representation: 



G 

Vi 



E 



ViHo—Vj] 
3=1 



Beta(l — a, 9 + ia) 



< iid ^ 

Vi ~ G , 



(12) 



where a is known as a discount parameter. The case a = returns the Dirichlet 
process (cf. Eq. (O). 

Note that in both the Dirichlet process and Pitman- Yor process, th e weights 
Wi T~ n~i (1 — Vj)} are the weights of the process in size- biased order Pitmanl . 
200(jj. In the Pitman- Yor case, the {Vi} are no longer identically distributed. 



4.2 Power laws in featural models 

The beta-Bernoulli process provides a specific kind of feature-based representa- 
tion of entities. In this section we study general featural models and consider 
the power laws that might arise for such models. 
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In the clustering framework, we considered N draws from a process that put 
exactly one mass of size one on some value in and mass zero elsewhere. In 
the featural framework we consider N draws from a process that places some 
non-negative integer number of masses, each of size one, on an almost surely 
finite set of values in and mass zero elsewhere. As Ni was the sum of masses 
at a point labeled ^ 6 f in the clustering framework, so do we now let TV, be 
the sum of masses at a point labeled ipi € "J. We use the same notation as in 
Section FfTTl but now we note that the counts Ni no longer sum to TV in general. 

In the case of featural models, we can still talk about Type I and II power 
laws, both of which have the same interpretation as in the case of clustering 
models. In the featural case, however, it is also possible to consider a third type 
of power law. If we let k n denote the number of features present in the nth 
draw, we say that k n shows power law behavior if 

¥(k n > M) ~ cM~ a 

for positive constants c and a. We call this last type of power law Type III. 



5 Stick-breaking for the beta process 



The weights {qi} for the beta pr ocess can be derived by a va riety of procedures, 
includin g size-biased sampling Thibaux and Jord"arl 12007 and inverse Levy 
measure [Wolpert and Ickstadt . 2004 Teh et all 20071 ]. The procedures that are 
closest in spi rit to the stick-break ing r epresentat i on for the Dirichlet process are 



those due to Paisley et al. 201d| and Teh et al. 2007 1. Our point of departure 



is the former, which has the following form: 



oc Ci i— 1 

i=i j=i 



i=i 



Ci 



Pois(7) 



VP") ~ Beta(l, 



id 1 R 
~ --DO- 

7 



(13) 



This representation is analogous to the stick-breaking representation of the 
Dirichlet process in that it represents a draw from the beta process as a sum 
over independently drawn atoms, with the weights obtained by a recursive pro- 
cedure. However, it is worth noting that for every tuple subscript for V®, 
a different stick exists and is broken across the superscript I. Thus, there are no 
special additive properties across weights in the sum in Eq. (|13|) ; by contrast, 
the weights in Eq. (|12p sum to one almost surely. 

The generalization of the one-parameter Dirichlet process to the two-parameter 
Pitman- Yor process suggests that we might consider generalizing the stick- 
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breaking representation of the beta process in Eq. (fT5|) as follows: 



oo d i—X 

i=i j=i i=i 

d ~ Pois( 7 ) 

^ ~ -Bo- (14) 

7 

In Section [S] we will show that introducing the additional parameter a indeed 
yields Type I and II power law behavior (but not Type III). 

In the remainder of this section we present a proof that these stick-breaking 
representations arise from the beta process. In contradistinction to the proof 
of Eq. (HU) by IPaislev et all [2010j . which used a limiting process defined on 



sequences of finite binary matrices, our approach makes a direct connection to 
the Poisson process characterization of the beta process. Our proof has several 
virtues: (1) it relies on no asymptotic arguments and instead comes entirely from 
the Poisson process representation; (2) it is, as a result, simpler and shorter; 
and (3) it demonstrates clearly the ease of incorporating a third parameter 
analogous to the discount parameter of the Pitman- Yor process and thereby 
provides a strong motivation for the extended stick-breaking representation in 
Eq. (HI. 

Aiming toward the general stick-breaking representation in Eq. J)14[ ). we 
begin by defining a three-parameter generalization of the beta process|j We say 
that B ~ BP(#, a, B$), where we call a a discount parameter, if, for tp £ ^, u £ 
[0, 1]), we have 

v BP (diP,du) = r(1 ^ ^-"(l-u)'-**- 1 duB Q (diP). (15) 
1(1 — a)L [0 + a) 

It is straightforward to show that this three-parameter density has similar 
properties to that of the two-parameter beta process. For instance, choosing 
a £ (0, 1) and 9 > —a is necessary for the beta process to have finite total mass 
almost surely; in this case, 

r(i-tt)r(0 + a) 

u VBp{dip, du) = — < oo. (16) 

'*xl + L \ L + V) 

We now turn to the main result of this section. 

Proposition 1. B can be represented according to the process described in 
Eq. HB if and only if B - BP(6, a, B ). 



3 See also iTeh and GoruJ [20091 ] or lKim and Led 1200 ill, with 0( t) = 1 - a,P(t) 
where the left-hand sides are in the notation of lKimand L3 1200111 . 
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Proof. First note that the points in the set 

Pi ■= {(i>i,i,v$), (i&i )2l v$), ■ • ■ , (to, vffij} 

are by construction independent and identically distributed conditioned on Ci. 
Since Ci is Poisson-distributed, Pi is a Poisson point process. The same logic 
gives that in general, for 

^ ■■= \ U+rii nV - v®)), . . . , L, Ci ri% ii(i - ) , 



1=1 



1=1 



Pi is a Poisson point process. 
Next, define 



P:=(Jfl 



As the countable union of Poisson processes with finite rate measures, P is itself 
a Poisson point process. 



Notice that we can write B as the completely random measure B = Y 



Also, for any B' ~ BP(#, d, Bo), we can write B' = Y^N,' U')en U'S^>, where II is 
Poisson point process with rate measure vbp = Bq x /^bp , and /ibp is a a- finite 
measure with density 



T(l -a)T(9 + a) 



du. 



(17) 



Therefore, to show that B has the same distribution as B' , it is enough to show 
that P and II have the same rate measures. 

To that end, let v denote the rate measure of P: 

v{A x A) = E#{(V>i, U t ) e Ax A)} 

oo d i—1 

= ±B (A) -E££ 1{V<*> IJ(1 - Vfp) e A} 

I i=\ j = l 1=1 

1 oo d i—1 

-Bo(A) • ]T E E ~ O e A h (18) 

i=l j=l 1=1 



7 



where the last line follows by monotone convergence. Each term in the outer 
sum can be further decomposed as 



EE 1 «i ) IK 1 -^? ) ) ei } = 

3=1 



3=1 *=1 
i-1 

E[ci]E iftfna-^)^} 



i-1 



2=1 
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since the are iid across j and independent of Cj 

i-l 



= 7 E1{^JJ(1-Vi) el} 

1=1 

for Vi m ~ p Beta(l - a, + ia), 



(19) 



where the last equality follows since the choice of {V,} gives V, IIi=i(l — Vi) = 

v^lUzKi-v^). 

Substituting Eq. (|19p back into Eq. (|18[) . canceling 7 factors, and applying 
monotone convergence again yields 

00 i—l 

v(A x A) = B (A) ■ EJ2 HVi - V) e A}. 

i=l i=l 

We note that both of the measures v and i'bp factorize: 

00 i—l 

u(AxA) = B (A)-Ej2HVll[(l-V l ')GA} 

j=i i=i 
zy BP (ylxi) = B a (A)fi BP (A), 

so it is enough to show that [i = ^bp for the measure p defined by 



(A) :=Ej2i{vY[(l~V)eA}. 



i=l 



1=1 



(20) 



At this point and later in proving Proposi tion El we will m ake use of part of 
Campbell's theorem, which we copy here from lKingman 1993 for completeness. 

Theorem 2 (Part of Campbell's Theorem). Let II be a Poisson process on S 
with rate measure p., and let f : S —> M be measurable. If J s min(|/(a;)|, 1) fi(dx) < 
00, then 



E 



£/(*) 



Lxen 



= / f(x) fJ.(dx). 



(21) 



Now let U be a size-biased pick from {Vi YYi=i(^ ~ Vi)}i=-i- By construction, 



for any bounded, measurable function g, we have 

00 i—l 



3 



i=l 1=1 



1=1 



Taking expectations yields 

00 i—l 



Eg(U) = E 



i-l 



E^IR 1 - W'lIC 1 -^)) 



U=l 1=1 



1=1 



ug(u)n(du), 
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where the final equality follows by Campbell's theorem with the choice f(u) = 
ug(u). Since this result holds for all bounded, measurable g, we have that 



F(U e du) = ufi(du). (22) 

Finally, we note that, by Eq. (|2"0)) , U is a size-biased sample from prob- 
abilities generated by stick-breaking with proportions {Beta(l — a, 9 + ia)}. 
Such a sample is then distributed Beta(l — a, 9 + a) since, as mentioned above, 
the Pitman- Yor stick-breaking construction gives the size-biased frequencies in 
order. So, rearranging Eq. (|22[) . we can write 



fj,(du) = u _1 P(i/ E du) 

-u^-^-^l-u) 



1 r(l + 0) ..(l-a)-l/ ,_ s(fl+a)-l 



r(i - a)r(0 + a) 
using the Beta(l — a, 9 + a) density 
= ^ B p (du), 

as was to be shown. □ 



6 Power law derivations 

By linking the three-parameter stick-breaking representation to the power-law 
beta process in Eq. (fT5|). we can use the results of the following section to 
conclude that the feature assignments in the three-parameter model follow both 
Type I and Type II power laws a nd that they do not fo llow a Type III power 
law (Section S3). We note that iTeh and Goriirl |2009| found big-0 behavior 



for Types I and II in the three-parameter Beta and a Poisson distribution for 
the Type III distribution. We can strengthen these results to obtain exact 
asymptotic behavior with constants in the first two cases and also conclude 
that Type III power laws can never hold in the featural framework whenever 
the sum of the feature probabilities is almost surely finite, an assumption that 
would appear to be a necessary component of any physically realistic model. 

6.1 Type I and II power laws 



Our subsequent derivation expands upon the work of iGnedin et al.1 [20071 ]. In 
that paper, the main thrust of the argument applies to the case in which the 
feature probabilities are fixed rather than random. In what follows, we obtain 
power laws of Type I and II in the case in which the feature probabilities are ran- 
dom, in particular when the probabilities are generated from a Poisson process. 
We will see that this last assumption becomes convenient in the course of the 
proof. Finally, we apply our results to the specific example of the beta-Bernoulli 
process. 

Recall that we defined Kjy, the number of represented clusters in the first 
N data points, and K^j, the number of clusters represented j times in the first 
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N data points, in Eqs. © and (J7J), respectively. In Section POl we noted that 
same definitions in Eqs. (J5J and ([7]) hold for featural models if we now let Ni 
denote the number of data points at time N in which feature % is represented. 
In terms of the Bernoulli process, Ni would be the number of Bernoulli process 
draws, out of N, where the ith atom has unit (i.e., nonzero) weight. It need not 
be the case that the JVj sum to N. 

Working directly to find power laws in Kn and K^,j as N increases is 
challenging in part due to N being an integer. A standard technique to surmount 
this difficulty is called Poissonization. In Poissonizing Kn and Knj, we consider 
new functions K(t) and Kj(t) where the argument t is continuous, in contrast 
to the integer argument N. We will define Kit) and Kj{t) such that K(N) and 
Kj{N) have the same asymptotic behavior as Kjy and Knj, respectively. 

In particular, our derivation of the asymptotic behavior of Kn and K^j 
will consist of three parts and will involve working extensively with the mean 
feature counts 

$ w := E[K N ] and $ Nd := E[K Njj ] (j > 1) 

with N E {1,2,...} and the Poissonized mean feature counts 

:= E[A'(i)] and := E[Kj(t)] (j > 1). 

with t > 0. First, we will take advantage of Poissonization to find power laws in 
<3?(i) and $j(t) as t —> oo (Proposition [3]). Then, in order to relate these results 
back to the original process, we will show that <I>jv and $(iV) have the same 
asymptotic behavior and also that $jv,j and $>j(N) have the same asymptotic 
behavior (Lemma [5| . Finally, to obtain results for the random process values 
K^ and Kn,], we will conclude by showing that Kn almost surely has the 
same asymptotic behavior as $jv and that X)fc<j ^N,k almost surely has the 
same asymptotic behavior as J2k<j ^N,k (Proposition [5]). 

To obtain power laws for the Poissonized process, we must begin by defining 
Kit) and Kj(t). To do so, we will construct Poisson processes on the positive 
half-line, one for each feature. K(t) will be the number of such Poisson processes 
with points in the interval [0,i\; similarly, Kjit) will be the number of Poisson 
processes with j points in the interval [0,t]. This construction is illustrated in 
Figure |31 It remains to specify the rates of these Poisson processes. 

Let (gi, q2, ■ ■ ■) be a countably infinite vector of feature probabilities. We 
begin by putting minimal restrictions on the q,;. We assume that they are 
strictly positive, decreasing real numbers. They need not necessarily sum to 
one, and they may be random. Indeed, we will eventually consider the case 
where the qi are the (random) atom weights of a beta process, and then we will 
have Yli Qi 1 with probability one. 

Let be a standard Poisson process on the positive real line generated with 
rate qi (see, e.g., the top five lines in Figure |4}. Then n := \J i IT is a standard 
Poisson process on the positive real line with rate qi (see, e.g., the lowermost 
line in Figure 2]), where we henceforth assume J2i Qi < 00 a -S. 
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Figure 4: The first five sets of points, starting from the top of the figure, il- 
lustrate Poisson processes on the positive half-line in the range t € (0, 5) with 
respective rates Qi,...,qs. The bottom set of points illustrates the union of 
all points from the preceding Poisson point processes and is, therefore, itself 
a Poisson process with rate YliQi- ^ n this example, we have for instance that 
K(l) = 2, K{4) = 5, and K 2 (A) = 1. 



Finally, as mentioned above, we define K(t) to be the number of Poisson 
processes IT with any points in [0,i]: 

K(t) :=Y,H\nin[o,t]\>o}. 

i 

And we define Kj(t) to be the number of Poisson processes IT with exactly j 
points in [0,i]: 

Kj(t) :=^i{|n 4 n[o,t]|=j}. 

i 

These definitions are very similar to the definitions of Kn and Ifjv j in Eqs. (|8]l 
and 0) respectively. The principal difference is that the Kn are incremented 
only at integer N whereas the K(t) can have jumps at any t G M+. The same 
observation holds for the ifjvj an d Kj{t). 

In addition to Poissonizing Kn and K^,j to define K(t) and Kj{t), we will 
also find it convenient to assume that the {qi\ themselves are derived from a 
Poisson process with rate measure v. We note that Poissonizing from a discrete 
index N to a continuous time index t is an approximation and separate from 
our assumption that the {qi} are generated from a Poisson process though both 
are fundamentally tied to the ease of working with Poisson processes. 

We are now able to write out the mean feature counts in both the Poissonized 
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and original cases. First, the Poissonized definitions of $ and K allow us to write 
$(f) :=E[K{t)] = E[E[K(t)\q]] = E[E[^ 1{|IL H [0, t]\ > 0}\q}]. 

i 

With a similar approach for 3>j(£), we find 

S(f) = E£(l - e- 4 *)], = E[£ Mie"*« 



With the assumption that the {qi} are drawn from a Poisson process with 
measure measure v, we can apply Campbell's theorem (Theorem [5]) to both the 
original and Poissonized versions of the process to derive the final equality in 
each of the following lines 

*(*) = E£(l - e-^)] = [\l - e- te ) v(dx) (23) 

Jo 

<!> N = E£(l - (1 - qi )»)] = f\l - (1 - x) w ) i/(dx) (24) 

$ 3 (t) = EE ^e-*«'] = ^e- te i/(da?) (25) 



Now we establish our first result, which gives a power law in <fr(t) and &j(t) 
when the Poisson process rate measure v has corresponding power law proper- 
ties. 

Proposition 3. Asymptotic behavior of the integral of is of the following form 

i/i[0,a;]:=/ uv{du) — x 1_a Z(l/x), x -> (27) 

Jo 1 ^ " 

where I is a regularly varying function and a £ (0, 1) implies 
$(t) - r(l -a)t a l(t), t^oo 

„ »r(j -a) faf(f)i i ^ oo (j>1). 

Proof. The key to this result is in the repeated use of Abelian or Tauberian 
theorems. Let i be a map A : F — >• G from one function space to another: 
e.g., an integral or a Laplace transform. For / G F, an Abelian theorem gives 
us the asymptotic behavior of A(f) from the asymptotic behavior of /, and a 
Tauberian theorem gives us the asymptotic behavior of / from that of A(f). 
First, integrating by parts yields 

rx />oo 

z/i[0,x] = —xv(x) + / v(u) du, v(x) := / v(u) du, 
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so the stated asymp totic behavior in v\ yields y(x) ~ l(l/x)x a (x — > 0) by a 
Tauberian theorem [Fellerl Il966l IGnedin et all . |2007| where the map A is an 



integral. 

Second, another integration by parts yields 



$(t) = t / e- tx D(x) dx. 
Jo 

The desired asymptotic be havior in <1> follows from the asy mptotic behavior in 
v and an Abelian theorem Feller , 19661 IGnedin et al. , 2007 [ where the map A is 



a Laplace transform. The result for $ ^ (t) follows from a similar argument when 
we note that repeated integration by parts of Eq. (|2"5j) also yields a Laplace 
transform. □ 

The importance of assuming that the qi are distributed according to a Pois- 
son process is that this assumption allowed us to write $ as an integral and 
thereby make use of classic Abelian and Tauberian theorems. The importance 
of Poissonizing the processes Kj and Kn,j is that we can write their means as 
in Eqs. (|23|) and ([25]) . which are — up to integration by parts — in the form of 
Laplace transforms. 

Proposition [3] is the most significant link in the chain of results needed to 
show asymptotic behavior of the feature counts Kn and Kjyj m that it relates 
power laws in the known feature probability rate measure v to power laws in 
the mean behavior of the Poissonizcd version of these processes. It remains to 
show this mean behavior translates back to Kn and Kjsrj, first by relating the 
means of the original and Poissonizcd processes and then by relating the means 
to the almost sure behavior of the counts. The next two lemmas address the 
former concern. Together they establish that the mean feature counts $at and 
<3?jv,j have the same asymptotic behavior as the corresponding Poissonized mean 
feature counts $(AT) and $j(N). 

Lemma 4. Let v be a -finite with v(du) = oo and J °° u v(du) < oo. Then 
the number of represented features has unbounded growth almost surely. The 
expected number of represented features has unbounded growth, and the expected 
number of features has sublinear growth. That is, 

K(t) t oo a.s., $(f) t oo, $(t) < t. 



Proof. As in IGnedin et al.l [2007 . the first statement follows from the fact that 



q is countably infinite and each qi is strictly positive. The second statement 
follows from monotone convergence. The final statement is a consequence of 
J2i Qi < 00 a - s - □ 

Lemma 5. Suppose the {qi} are generated according to a Poisson process with 
rate measure as in Lemma^ Then, for N — > oo, 
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I* 



N,j 



<jJmax{$ j (JV) ) $ i+J (iV)} -> 0. 



for some constants cj. 

Proof. The proof is the same as that of Lemma 1 of lGnedin et al. I j2007j . Estab- 
lishing the inequalities results from algebraic manipulations. The convergence 
to zero is a consequence of Lemma |4] □ 

Finally, before considering the specific case of the three-parameter beta pro- 
cess, we wish to show that power laws in the means $at and $jv,j extend to 
almost sure power laws in the number of represented features. 

Proposition 6. Suppose the {qi} are generated from a Poisson process with 
rate measure as in Lemma^4\ For N oo, 



k<j 

Proof. We wish to show that Kn/$n - 
cnough to show that, for any e > 0, 



k<j 



N.k- 



1 as TV oo. By Borcl-Cantelli, it is 



N 



N 



- 1 



A' 



> e < oo. 



To that end, note 

P{\K N - $ N \ > e$ N ) < P($at > e$ N + K N )+V(K N > e<S> N 
The note after Theorem 4 in Freedman |l973j gives that 



$jv). 



'($ N > e$Ar + K N ) < exp (-e 2 $ 



N) 



'(K N >e$ N + $ N ) < exp - 



1 



N 



So 



>e) < 2exp(-2e 2 $ A r) 
< ccxp(-2e 2 iV) 



for some constant c and sufficiently large N by Lemmas 0] and [5] The last 
expression is summable in N, and Borel-Cantelli holds. 

The proof that J2k<j ^N,k ~' J2k<j follows the same argument. □ 

It remains to show that we obtain Type I and II power laws in our special case 
of the three-parameter beta process, which implies a particular rate measure v 
in the Poisson process representation of the {qi}. For the three-parameter beta 
process density in Eq. (|15j) . we have 



/^i[0,.t] = / u VBp{dif>,du) 
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7 ' 
7- 
7- 



r(i 



r(i - a)r(6> + a) 

r(i + e) 
r(i-o)r(e + a) y 

r(i + 0) 



U - Q (i -u) 0+a - 

u~ a du, x 4- 
1 



G?1t 



r(l - a)r(6> + a) 1-af 



The final line is exactly the form required by Eq. (|2"T|) in Proposition [31 with 
l(y) equal to the constant function of value 



C := 



T(l 



7 



a r(l - a)r(0 + a) 



(28) 



Then Proposition [3] implies that the following power laws hold for the mean 
of the Poissonizcd process: 



*(*) a 

Lemma [5] further yields 



r(l -a)Ct a , t^oo 
ar(j - a) 



oo (j > 1). 



N 



r(l-a)CiV Q , AT->-oo 

<*r(j - a) 



-CW a , AT->oo (j > 1), 



and finally Proposition [5] implies 



A" 



AOv a ~ T(l-a)CN a , N 
a. s . dT(j - a) 



N,j 



CN a , N^oo (j > 1). 



(29) 
(30) 



These are exactly the desired Type I and II power laws (Eqs. (|9|) and (|T0|) ) for 
appropriate choices of the constants. 

6.2 Exponential decay in the number of features 

Next we consider a single data point and the number of features which are 
expressed for that data point in the featural model. We prove results for the 
general case where the ith feature has probability qi > such that J7 qi < oo. 
Let Zi be a Bernoulli random variable with success probability qi and such 
that all the Zi a re independent. Then E[y7 Zj] = y7 =: Q. In this 
Chernoff bound Chernofll . 1952 . Hagerup and Rub . ll990| tells us that, for any 
5 > 0, we have 



[J2Z< > (l + S)Q}<e 5 Q(l + 5)- (1 



-(i+S)Q 
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When M is large enough such that M > Q, we can choose 5 such that (1+S)Q = 
M . Then this inequality becomes 



P[JT Zi > M] < e M ~ Q Q M M~ M for M > Q. (31) 

i 

We see from Eq. (f3"Tj) that the number of features J^i Zi that are expressed 
for a data point exhibits super-exponential tail decay and therefore cannot have 
a power law probability distribution when the su m of feature probabilities Y*, ; qt 



is finite. For comparison, let Z ~ Pois(Q). Then [Franceschetti et all 12007 1 

F[Z > M] < e M - Q Q M M~ M for M > Q, 

the same tail bound as in Eq. (|3"Tj) . 

To apply the tail-behavior result of Eq. (|3lj) to the beta process (with two 
or three parameters), we note that the total feature probability mass is finite 
by Eq. P^)l . Since the same set of feature probabilities is used in all subsequent 
Bernoulli process draws for the beta-Bernoulli process, the result holds. 



7 Simulation 

To illustrate the three types of power laws discussed above, we simulated beta 
process atom weights under three different choices of the discount parameter 
a, namely a = (the classic, two-parameter beta process), a = 0.3, and a = 
0.6. In all three simulations, the remaining beta process parameters were kept 
constant at total mass parameter value 7 = 3 and concentration parameter 
value 6 = 1. 



The simulations were carried out using our extension of the iPaislev et al 



2010} stick-breaking construction in Eq. (|Ti|) . Wc generated 2,000 rounds of fea- 



ture probabilities; that is, we generated 2,000 random variables Ci and Y^i=i° Q 
feature probabilities. With these probabilities, we generated N = 1,000 data 
points, i.e., 1,000 vectors of (2,000) independent Bernoulli random variables 
with these probabilities. With these simulated data, we were able to perform 
an empirical evaluation of our theoretical results. 

Figure [5] illustrates power laws in the number of represented features Kn on 
the left (Type I power law) and the number of features represented by exactly 
one data point if/v,i on the right (Type II power law). Both of these quantities 
are plotted as functions of the increasing number of data points N. The blue 
points show the simulated values for the classic, two-parameter beta process 
case with a = 0. The center set of black points in each case corresponds to 
a = 0.3, and the upper set of black points in each case corresponds to a = 0.6. 

Wc also plot curves obtained from our theoretical results in order to compare 
them to the simulation. Recall that in our theoretical development, we noted 
that there are two steps to establishing the asymptotic behavior of Kn and 
Kn,j as N increases. First, we compare the random quantities Kn and Kn.] to 
their respective means, $at and These means, as computed via numerical 
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Feature count growth with N Feature count growth with N 
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Figure 5: Growth in the number of represented features Kn (left) and the 
number of features represented by exactly one data point Km,i {fight) as the 
total number of data points N grows. The points in the scatterplot are derived 
by simulation; blue for a = 0, center black is a = 0.3, and upper black for 
a = 0.6. The red lines in the left plot show the theoretical mean $jv (Eq. ([HP ; 
in the right plot, they show the theoretical mean $jv,i (Eq. (pM]) ). The green 
lines show the theoretical asymptotic behavior, Eq. (f2T)| on the left (Type I 
power law) and Eq. (|30p on the right (Type II power law). 



quadrature from Eq. (|24p and directly from Eq. (|26|) . are shown by red curves 
in the plots. Second, we compare the means to their own asymptotic behavior. 
This asymptotic behavior, which we ultimately proved was shared with the 
respective Km or -Kjvj m Eqs. (|29|) and ([30]) , is shown by green curves in the 
plots. 

We can see in both plots that the a = behavior is distinctly different from 
the straight-line behavior of the a > examples. In both cases, we can see that 
any growth in a is slower than can be described by straight-line growth. In 
particular, when a = 0, the expected number of features is 



>N 



= E[K N ] = E 



N 

E 

n=l 



Pois 



7: 



N 

E 

n=l 



7: 



7 01og(/V). (32) 



Similarly, when a — 0, the expected number of features represented by exactly 
one data point, Kn,i, is (by Eq. (|26[0 



<I> 



NA 



E[K 



N6 



N,l 



N 
1 

r(i)r(jv 



i 



^(l 
■0) 



x) 



N-l 



N 



N - 1 



T{N + 9) 

where the second line follows from using the normalization constant of the 
(proper) beta distribution. Interestingly, while K^ y \ grows as a power law when 
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Figure 6: Left: Change in the number of features with exactly j representatives 
among N data points for fixed N as a function of j. The blue points, with 
connecting lines, are for a = 0; middle black are for a = 0.3, upper black 
for a = 0.6. The green lines show the theoretical asymptotic behavior in j 
(Eqs. ([TO]) and (fTTj) ) for the two a > cases. Right: Change in the number of 
data points, indexed by n, with number of feature assignments k n greater than 
some positive, real- valued M as M increases. Neither the a = case (blue) nor 
the a > cases (black) exhibit Type III power laws. 

a > 0, its expectation is constant when a = 0. While many new features are 
instantiated as N increases in the a = case, it seems that they are quickly 
represented by more data points than just the first one. 

Type I and II power laws are somewhat easy to visualize since we have 
one point in our plots for each data point simulated. The behavior of Kjyj 
as a function of j for fixed N and type III power laws (or lack thereof) are 
somewhat more difficult to visualize. In the case of Knj as a function of j, wc 
might expect that a large number of data points N is necessary to sec many 
groups of size j for j much greater than one. In the Type III case, we have 
seen that in fact power laws do not hold for any value of a in the beta process. 
Rather, the number of data points exhibiting more than M features decreases 
more quickly in M than a power law would predict; therefore, we cannot plot 
many values of M before this number effectively goes to zero. 

Nonetheless, Figure [6] compares our simulated data to the approximation 
of Eq. (Uni) with Eq. ((TTJ> (left) and Type III power laws (right). On the left, 
blue points as usual denote simulated data under a = 0; middle black points 
show a = 0.3, and upper black points show a = 0.6. Here, we use connecting 
lines between plotted points to clarify a values. The green lines for the a > 
case illustrate the approximation of Eq. (fTT|) . Around j = 10, we see that the 
number of feaures exhibited by j data points, Kjyj, degenerates to mainly zero 
and one values. However, for smaller values of j we can still distinguish the 
power law trend. 
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Ordered feature frequencies 




Figure 7: Feature probabilities from the beta process plotted in decreasing size 
order. Blue points represent probabilities from the a = case; center black 
points show a = 0.3, and upper black points show a = 0.6. The green lines 
show theoretical asymptotic behavior of the ranked probabilities (Eq. (|3"3"|) ). 



On the right-hand side of Figure [6l we display the number of data points 
exhibiting more than M features for various values of M across the three values 
of a. Unlike the previous plots in Figure [5] and Figure [6j there is no power- law 
behavior for the cases a > 0, as predicted in Section HT2l We also note that 
here the a = 0.3 curve does not lie between the a = and a = 0.6 curves. Such 
an occurrence is not unusual in this case since, as we saw in Eq. pip , the rate 
of decrease is modulated by the total mass of the feature probabilities drawn 
from the beta process, which is random and not necessarily smaller when a is 
smaller. 

Finally, since our experiment involves generating the underlying feature 
probabilities from the beta process as well as the actual feature assignments 
from repeated draws from the Bernoulli process, we may examine the feature 
probabilities themselves; see Figure [7] As usual, the blue points represent the 
classic, two-parameter (a = 0) beta process. Black points represent a = 0.3 
(center) and a = 0.6 (upper). Perhaps due to the fact that there is only the 
beta process noise to contend with in this aspect of the simulation (and not the 
combined randomness due to the beta process and Bernoulli process), we see 
the most striking demonstration of both power law behavior in the a > cases 
and faster decay in the a = case in this figure. The two a > cases clearly 
adhere to a power law that may be predicted from our results above and the 
Gnedin et al. I j2007j results with C as in Eq. 



#{« : qi > x} °~ Cx~ a x 10. (33) 

Note that ranking the probabilities merely inverts the plot that would be cre- 
ated with x on the horizontal axis and {i : qt > x} on the vertical axis. The 
simulation demonstrates little noise about these power laws beyond the 100th 
ranked probability. The decay for a = is markedly faster than the other cases. 
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8 Experimental results 



We have seen that the Poisson process formulation allows for an easy extension of 
the beta process to a three-parameter model. In this section we stu dy this model 



empir ically in the setting of the modeling of handwritten digits. iPaislev et al 



2010} present results for this problem using a two-parameter beta process cou- 



pled with a discrete factor analysis model; we repeat those experiments with 
the three-parameter beta process. The data consists of 3,000 examples of hand- 
written digits, in particular 1,000 handwriting sample s of each of the digits 3, 5 



and 8 from th e MNIST Handwritten Digits database [LeCun and Cortesl . 11998 



RoweisL 2007 1 . Each handwritten digit is represented by a matrix of 28x28 pix- 



els; we project these matrices into 50 dimensions using principal components 
analysis. Thus, our data takes the form X G U 50x3000 ; and we may apply the 
beta process factor model from Eq. ([2]) with P = 50 and N = 3,000 to discover 
latent structure in this data. 

The generative model for X that we use is as follows [see IPaislev et af1 . l2010j : 



X = (WoZ)$ + E 

Z ~ BP-BcP(N,7,6»,a) 

$ k , p ^N(0,p p ) 

w„, fc ~ jv(o,o 

E n , p ~ N(0, V ), (34) 

with hyperparameters 9, a, 7, Bq, {p P }p = i, Ci V- Recall from Eq. ([2]) that X 6 
jjJVxp j g tnc data, $ G R KxP is a matrix of factors, and E G K JVxP is an error 
matrix. Here, we introduce the weight matrix W G M. NxK , which modulates 
the binary factor loadings Z G M> NxK . In Eq. (|34|) . o denotes elementwisc 
multiplication, and the indices have ranges n G {1, . . . , N}, k G {1, . . . , K},p G 
{1, . . . ,P}. Since we draw Z from a beta-Bernoulli process, the dimension K 
is theoretically infinite in the generative model notation of Eq. (|34|). However, 
we have seen that the number of columns of Z with nonzero entries is finite a.s. 
We use K to denote this number. 

We initialized both the two-parameter and the three-parameter models with 
the same number of latent features, K = 200, and the same values for all shared 
parameters (i.e., every variable except the new discount parameter a). We ran 
the experiment for 2,000 MCMC iterations, noting that the MCMC runs in both 
models seem to have reached equilibrium by 500 iterations (sec Figures [8] and 

El). 

Figures [8] and [9] show the sampled values of various parameters as a function 
of MCMC iteration. In particular, we see how the number of features K (Fig- 
ure [8]), the concentration parameter 9, and the discount parameter a (Figure [9| 
change over time. All three graphs illustrate that the three-parameter model 
takes a longer time to reach equilibrium than the two-parameter model (ap- 
proximately 500 iterations vs. approximatively 100 iterations). However, once 
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Figure 8: The number of latent features K as & function of the MCMC iteration. 
Results for the original, two-parameter model are represented on the left, and 
results for the new, three-parameter model arc illustrated on the right. 
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Figure 9: The random values drawn for the hyperparameters as a function 
of the MCMC iteration. Draws for the concentration parameter 9 under the 
two-parameter model are shown on the left, and draws for 8 under the three- 
parameter model are shown in the middle. On the right are draws of the new 
discount parameter a under the three-parameter model. 
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Figure 10: Autocorrelation of the number of factors K , concentration parameter 
(9, and discount parameter a for the MCMC samples after burn-in (where burn- 
in is taken to end at 500 iterations) under the two-parameter model (left) and 
three-parameter model (right). 



at equilibrium, the sampling time series associated with the three-parameter 
iterations exhibit lower autocorrelation than the samples associated with the 
two-parameter iterations (Figure [TU]). In the implementation of both the orig- 
inal two-parameter model and the three-parameter model, the range for 9 is 
considered to be bounded above by approximately 1 00 for computationa l rea- 
sons (in accordance with the original methodology of Paisley et al. }2010l |). As 
shown in Figure |9l this bound affects sampling in the two-parameter experiment 
whereas, after burn-in, the effect is not noticeable in the three-parameter exper- 
iment. While the discount parameter a also comes close to the lower boundary 
of its discretization (Figure |9]) — which cannot be exactly zero due to computa- 
tional concerns — the samples nonetheless seem to explore the space well. 

We can see from Figure [10] that the estimated value of the concentraton 
parameter 9 is much lower when the discount parameter a is also estimated. 
This behavior may be seen to result from the fact that the power law growth 
of the expected number of represented features $at in the a > case yields a 
generally higher expected number of features than in the a = case for a fixed 
concentration parameter 9. Further, we see from Eq. (|32p that the expected 
number of features when a = is linear in 9. Therefore, if we instead fix the 
number of features, the a — model can compensate by increasing 9 over the 
a > model. Indeed, we see in Figure [8] that the number of features discovered 
by both models is roughly equal; in order to achieve this number of features, 
the a = model seems to be compensating by overestimating the concentration 
parameter 9. 

To get a sense of the actual output of the model, we can look at some of 
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Figure 11: Upper. The top nine features by sampled representation across the 
data set on the final MCMC iteration for the original, two-parameter model. 
Lower. The top nine features determined in the same way for the new, three- 
parameter model. 



the learned features. In particular, we collected the set of features from the last 
MCMC iteration in each model. The fcth feature is expressed or not for the 
nth data point according to whether Z n k is one or zero. Therefore, we can find 
the most-expressed features across the data set using the set of features on this 
iteration as well as the sampled Z matrix on this iteration. We plot the nine 
most-expressed features under each model in Figure [TTJ In both models, we can 
see how the features have captured distinguishing features of the 3, 5, and 8 
digits. 

Finally, we note that the three-parameter version of the algorithm is com- 
petitive with the two-parameter version in running time once equilibrium is 
reached. After the burn-in regime of 500 iterations, the average running time 
per iteration under the three-parameter model is 14.5 seconds, compared with 
11.7 seconds average running time per iteration under the two-parameter model. 



9 Conclusions 

W e have shown that t he stick-breaking representation of the beta process due 
to Paisley et al. 201d| can be obtained directly from the representation of the 



beta process as a completely random measure. With this result in hand the set 
of connections between the beta process, stick-breaking, and the Indian buffet 
process are essentially as complete as those linking the Dirichlet process, stick- 
breaking, and the Chinese restaurant process. 

We have also shown that this approach mo tivates a three-param eter general- 
ization of the stick- breaking representation of Paisley et al. |2010| , which is the 



analog of the Pitman- Yor generalization of the stick-breaking representation for 
the Dirichlet process. We have shown that Type I and Type II power laws follow 
from this three-parameter model. We have also shown that Type III power laws 
cannot be obtained within this framework. It is an open problem to discover 
useful classes of stochastic processes that provide such power laws. 
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A A Markov chain Monte Carlo algorithm 

Posterior inference under the three-parameter model can be performed with a 
Markov chain Monte Carlo (MCMC) algorithm. Many conditionals have simple 
forms that allow Gibbs sampling alth ough others require fu rther approximation. 
Most of our sampling steps are as in IPaislev et al. 2010j with the notable ex- 



ceptions of a new sampling step for the discount parameter a and integration 
of the discount parameter a into the existing framework. We describe the full 
algorithm here. 

A.l Notation and auxiliary variables 



Call the index i in Eq. (|14[) the round. Then introduce the round-indicator 
variables r k such that r k = i exactly when the fcth atom, where k indexes the 
sequence (V>i,i) • ■ • ,ipi,Ci>ip2,ii ■ ■ ■ I - 02,c 2 ? • ■ •)> occurs in round i. We may write 



00 



■1=1 {j=i 

To recover the round lengths C from r = (r\, r%, . . .), note that 



C< = 5^1(r fc =i). (35) 
fe=i 

With the definition of the round indicators r in hand, we can rewrite the 
beta process B as 

00 r k 

fe=i j=i 

where V k .j *~ Beta(l— a, 9+ia) and ipk '~ 7 _1 ^o as usual although the indexing 
is not the same as in Eq. ([Til) . It follows that the expression of the A:th feature 
for the nth data point is given by 

r h -l 

Z n ,k ~ Bern (ir k ) , n k := V k: r k JJ (1 - V k , 3 ). 
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We also introduce notation for the number of data points in which the fcth 
feature is, respectively, expressed and not expressed: 

N N 

m ltk := MZn.k = 1), m , k ■= ^ M z n,k = 0) 

n— 1 n—1 

Finally, let K be the number of represented features; i.e., K := j^{k : m^i > 0}. 
Without loss of generality, we assume the represented features are the first K 
features in the index k. The new quantities {r k }, {wii^}, {wo,fe}i and K will 
be used in describing the sampler steps below. 



A. 2 Latent indicators 

First, we describe the sampling of the round indicators {r k } and the latent 
feature indicators {Z n ^}- In these and other steps in the MCMC algorithm, we 
integrate out the stick-breaking proportions {Vi}. 



A. 2.1 Round indicator variables 

We wish to sample the round indicator r k for each feature k with 1 < k < K. 
We can write the conditional for r k as 

P{rk = i\{n}th{Z n ,k}^xAa,l) 

cx p({Z n>k }% =1 \r k = i, 6, a)p(r k = i\{n}^). (36) 

It remains to calculate the two factors in the product. 

For the first factor in Eq. (|36j) . wc write out the integration over stick- 
breaking proportions and approximate with a Monte Carlo integral: 

p{{Z n>k }^ 1 \r k = i,6,a) = I ^ 1 '"{l--n k ) m ^ dV 

* ^E^r^d-^r "- (37) 

s — l 

Here, tt^ := V$ k I^Cl"^), and V$ ^ Beta(l - a, 0+ja). Also, S is 
the number of sa mples in the sum app roximation. Note that the computational 
trick employed in IPaislev et al.l (2010| for sampling the {Vi} relies on the first 
parameter of the beta distribution being equal to one; therefore, the sampling- 
described above, without further tricks, is exactly the sampling that must be 
used in this more general parameterization. 

For the second factor in Eq. (|36[) . there is no dependence on the a parameter, 
so the draws are the same as in IPaislev et ai1|2010| . For R k := £)* =1 Ifo = r k ), 
we have 
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r < r*_i 
r = rk-i 

Pois(0|7)) Pois(0|7)' 1 " 1 r = r fe _i + h 

for each h > 1. Note that these draws make the approximation that the first 
if features correspond to the first if tuples in the double sum of Eq. (fT3)| ; 
these orderings do not in general agree. 

To complete the calculation of the posterior for r^, we need to sum over 
all values of i to normalize p{rk = i\{u}i=i ■, {Z n ,k}n=\i a i l)- Since this is 
not computationally feasible, an alternative method is to calculate Eq. (|36|) for 
increasing values of i until the result falls below a pre-determined threshold. 







i r E, H = k r 1_1 poKii7) 

1-EtV 1 Pois(»| 7 ) 

i-Efir I_1 p°is(ii7) 



1 



(1 



A. 2. 2 Factor indicators 

In finding the posterior for the fcth feature indicator in the nth latent factor, 
Z n ,k, we can integrate out both {Vi} and the weight variables {W^i.fc}- The 
conditional for Z n ^ is 

p(Z nt k\X nt ., $, Z n - k , r, 6, a, <q, Q 

= p(X ni .\Z nt .,$,n,Qp(Zn,k\r,0,a,Z nt - k ). (38) 

First, we consider the likelihood. For this factor, we integrate out W explic- 
itly: 

p(X nr \Z n ,.,$,ri,Q 

p(X n> .\Z nt .,$,W,ri)p(W\Q 

w 

N{X^W n j^ I ..,r,Ip)N(W n . I \Q w ,CI\i\)dW nJ 
where I — {i : Z n j = 1} 



= N X n \0p, 



v ip - " */,■ vn~ $i + C i\i\) * 



N(X ni .\0 P ,T]Ip + 



where the final step follows from the Sherman-Morrison- Woodbury lemma. 
For the second factor in Eq. (|3"5)) . we can write 

(r, I a 7 \ P( z n\r,9,a) 

p{Z n k \r, (J, a, /„ k ) = —— : — - — r, 

p(Z n _ k \r, 0, a) 

and the numerator and denominator can both be estimated as integrals over V 
using the same Monte Carlo integration trick as in Eq. (f3"T|). 
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A. 3 Hyperparameters 



Next, we describe sampling for the three parameters of the beta process. The 
mass and concentration parameters are shared by the two-parameter process; 
the discount parameter is unique to the three-parameter beta process. 

A. 3.1 Mass parameter 

With the round indicators {r^} in hand as from Appendix IA.2.11 above, we can 
recover the round lengths {Ci} with Eq. (|35[) . Assuming an improper gamma 
prior on 7 — with both shape and inverse scale parameters equal to zero — and 
recalling the iid Poisson generation of the {Ci}, the posterior for 7 is 

rK 

p{ 1 \r,Z : 6,a) = G a { 1 \Y,Ci,r K ). 

i=i 

Note that it is necessary to sample 7 since it occurs in, e.g., the conditional for 
the round indicator variables (Appendix IA.2.11) . 

A. 3. 2 Concentration parameter 

The conditional for 9 is 

K 

p(6\Z,r,a) cxp(6)l[p(Z\r,6,a). 

k=l 

Again, we calculate the likelihood factors p(Z\r, 8, a) with a Monte Carlo ap- 
proximation as in Eq. (|37[) . In order to find the conditional over 9 from the like- 
lihood and prior, we further approximate the space of 9 > by a discretization 
around the previous value of 9 in the Monte Carlo sampler: {9 prev + tA9Y t z^, 
where 5* and T are chosen so that all potential new 9 values are nonnegativc 
and so that the tails of the distribution fall below a pre-determined threshold. 
To complete the description, we choose the improper prior p(9) cx 1. 

A. 3. 3 Discount parameter 

We sample the discount parameter a in a similar manner to 9. The conditional 
for a is 

K 

p(a\Z,r,8) cx p(a) Y[p(Z\r,6,a). 

k=l 

As usual, we calculate the likelihood factors p(Z\r,8, a) with a Monte Carlo 
approximation as in Eq. Q37p. While we discretize the sampling of a as we did 
for 9, note that sampling a is more straightforward since a must lie in [0,1]. 
Therefore, the choice of Aa completely characterizes the discretization of the 
interval. In particular, to avoid endpoint behavior, we consider new values of 

a among {Aa/2 + iAajj^Q ' 1 . Moreover, the choice of p(a) cx 1 is, in this 
case, a proper prior for a. 
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A. 4 Factor analysis components 



In order to use the beta process as a prior in the factor analysis model described 
in Eq. ([5]), we must also describe samplers for the feature matrix 4> and weight 
matrix W . 

A. 4.1 Feature matrix 

The conditional for the feature matrix $ is 

p(<$>. tP \X,W,Z,ri,p p ) cx p(X. iP \®., p ,W,Z,t]In)p(®.,p\Pp) 

= N{X., P \(W o Z)$., p ,r!l N )N($., p \0 K ,p p I K ) 
cx TV 

where, in the final line, the variance is defined as follows: 

E := (tTW o Z) t (W o Z) + p^IkT 1 , 
and similarly for the mean: 

p, := S? ? " 1 (VKoZ) T X,p. 

A. 4. 2 Weight matrix 

Let I = {i : Z n % = 1}. Then the conditional for the weight matrix W is 

p(W n ,i\X,Z,$,r)) cx p{Xn,.\9i,.,W n ,i,Ti)p{Wn,i\0 

= N(X n ,.\Wn,i$i,.,riI p )N(W n ,i\0\i\,CI\i\) 
cx N(W n ,i\fi,t), 

where, in the final line, the variance is defined as E := (?7 _1 $j j .$J. + , 
and the mean is defined as p := Ti-q~ 1 X n ^.<tJ .. 
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